For this project, I will be using the Universal Workflow introduced in Deep Learning With Python 4.5. While I am aware that this book mainly covers deep learning techniques focusing on neural networks, I believe this workflow to be extendible to traditional machine learning algorithms as well.
Defining the problem: Clearly define the problem, and understand the task at hand, the available data, and the outcome desired.
Preparing the data: Transform raw data into a form that is appropriate for use in a deep learning model. This may include data cleaning, normalization, encoding, splitting into training/validation/test sets, etc.
The following steps will then be repeated for each machine learning model that will be explored in this project.
Defining the model: Choose an appropriate architecture for the problem, including the number of layers, the types of layers, the activation functions, etc.
Compiling the model: For neural networks, specify the optimizer, the loss function, and the metrics that will be used to evaluate the model during training.
Training the model: Train the model on the training data set.
Evaluating the model: Evaluate the performance of the model on the test set to estimate its real-world performance.
Tuning the model: If the performance is not satisfactory, adjust the hyperparameters, and repeat the previous steps until a satisfactory model is obtained.
Using the model: Deploy the trained model on new data to make predictions or classifications.
To this end, I will be using the Breast Cancer Wisconsin (Diagnostic) dataset from Kaggle, which consists of data computed from digitised images of tumour cells extracted by Fine Needle Aspiration.
The main measures of success by which we will be judging our model will be recall, followed by accuracy. This is because, for this problem, the cost of false negatives is higher than the cost of false positives, as a misdiagnosis of a tumour as benign may result in delays in treatment, leading to the cancer progressing, and becoming harder to treat. However, while that is the case, the costs of false positives are not negligible either, and cannot be ignored, as it can lead to costly treatments, additional tests, and emotional stress inflicted upon the patients. Therefore, recall, which reflects the percentage of false negatives, will be our first priority as a metric for success.
# General Use Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Machine Learning Libraries
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import recall_score, accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
# Neural Network Libraries
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D,Flatten,Dense,Dropout
# Number that will be used for random seeds, to ensure replicable results
state = 73
from numpy.random import seed
from tensorflow.keras.utils import set_random_seed
seed(state)
set_random_seed(state)
# Loading the dataset
df = pd.read_csv('data.csv')
df.head()
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | NaN |
| 1 | 842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | NaN |
| 2 | 84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | NaN |
| 3 | 84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | NaN |
| 4 | 84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | NaN |
5 rows × 33 columns
# Get brief overview of datatypes, as well as check for any null entries
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 569 non-null int64 1 diagnosis 569 non-null object 2 radius_mean 569 non-null float64 3 texture_mean 569 non-null float64 4 perimeter_mean 569 non-null float64 5 area_mean 569 non-null float64 6 smoothness_mean 569 non-null float64 7 compactness_mean 569 non-null float64 8 concavity_mean 569 non-null float64 9 concave points_mean 569 non-null float64 10 symmetry_mean 569 non-null float64 11 fractal_dimension_mean 569 non-null float64 12 radius_se 569 non-null float64 13 texture_se 569 non-null float64 14 perimeter_se 569 non-null float64 15 area_se 569 non-null float64 16 smoothness_se 569 non-null float64 17 compactness_se 569 non-null float64 18 concavity_se 569 non-null float64 19 concave points_se 569 non-null float64 20 symmetry_se 569 non-null float64 21 fractal_dimension_se 569 non-null float64 22 radius_worst 569 non-null float64 23 texture_worst 569 non-null float64 24 perimeter_worst 569 non-null float64 25 area_worst 569 non-null float64 26 smoothness_worst 569 non-null float64 27 compactness_worst 569 non-null float64 28 concavity_worst 569 non-null float64 29 concave points_worst 569 non-null float64 30 symmetry_worst 569 non-null float64 31 fractal_dimension_worst 569 non-null float64 32 Unnamed: 32 0 non-null float64 dtypes: float64(31), int64(1), object(1) memory usage: 146.8+ KB
As we can see, the dataset was imported into the notebook with an empty column, likely due to a missing comma in the CSV file. Therefore, the first step to take is to drop that column.
# Drop empty column
df = df.drop(['Unnamed: 32'], axis = 1)
# Check new shape of dataframe
df.shape
(569, 32)
# Get statistical overview of columns
df.describe()
| id | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5.690000e+02 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | ... | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 |
| mean | 3.037183e+07 | 14.127292 | 19.289649 | 91.969033 | 654.889104 | 0.096360 | 0.104341 | 0.088799 | 0.048919 | 0.181162 | ... | 16.269190 | 25.677223 | 107.261213 | 880.583128 | 0.132369 | 0.254265 | 0.272188 | 0.114606 | 0.290076 | 0.083946 |
| std | 1.250206e+08 | 3.524049 | 4.301036 | 24.298981 | 351.914129 | 0.014064 | 0.052813 | 0.079720 | 0.038803 | 0.027414 | ... | 4.833242 | 6.146258 | 33.602542 | 569.356993 | 0.022832 | 0.157336 | 0.208624 | 0.065732 | 0.061867 | 0.018061 |
| min | 8.670000e+03 | 6.981000 | 9.710000 | 43.790000 | 143.500000 | 0.052630 | 0.019380 | 0.000000 | 0.000000 | 0.106000 | ... | 7.930000 | 12.020000 | 50.410000 | 185.200000 | 0.071170 | 0.027290 | 0.000000 | 0.000000 | 0.156500 | 0.055040 |
| 25% | 8.692180e+05 | 11.700000 | 16.170000 | 75.170000 | 420.300000 | 0.086370 | 0.064920 | 0.029560 | 0.020310 | 0.161900 | ... | 13.010000 | 21.080000 | 84.110000 | 515.300000 | 0.116600 | 0.147200 | 0.114500 | 0.064930 | 0.250400 | 0.071460 |
| 50% | 9.060240e+05 | 13.370000 | 18.840000 | 86.240000 | 551.100000 | 0.095870 | 0.092630 | 0.061540 | 0.033500 | 0.179200 | ... | 14.970000 | 25.410000 | 97.660000 | 686.500000 | 0.131300 | 0.211900 | 0.226700 | 0.099930 | 0.282200 | 0.080040 |
| 75% | 8.813129e+06 | 15.780000 | 21.800000 | 104.100000 | 782.700000 | 0.105300 | 0.130400 | 0.130700 | 0.074000 | 0.195700 | ... | 18.790000 | 29.720000 | 125.400000 | 1084.000000 | 0.146000 | 0.339100 | 0.382900 | 0.161400 | 0.317900 | 0.092080 |
| max | 9.113205e+08 | 28.110000 | 39.280000 | 188.500000 | 2501.000000 | 0.163400 | 0.345400 | 0.426800 | 0.201200 | 0.304000 | ... | 36.040000 | 49.540000 | 251.200000 | 4254.000000 | 0.222600 | 1.058000 | 1.252000 | 0.291000 | 0.663800 | 0.207500 |
8 rows × 31 columns
This dataset includes an id column, a diagnosis column, and three sets of data columns - the mean of each feature for each sample, the worst, or largest mean value, of each feature, and the standard error for each feature. These are denoted by the columns titled [feature]_mean, [feature[_worst, and [feature]_serespectively.
Among these, the diagnosis column will act as the class column for the classification task, while the id column will serve as the identification column, which will largely be unused in the couurse of this project.
The remaining 30 columns, on the other hand, are columns that reflect measurements of physical characteristics of the tumoours, and therefore are more likely to be indicative of whether the tumour is malignant or benign. This will be the pool of potential features to be used in the classification of the tumours.
# Feature standard error columns
se_features = [x for x in df.columns if '_se' in x]
se_df = df[se_features]
# Feature mean columns
mean_features = [x for x in df.columns if '_mean' in x]
mean_df = df[mean_features]
# Feature worst columns
worst_features = [x for x in df.columns if '_worst' in x]
worst_df = df[worst_features]
class_counts = df['diagnosis'].value_counts(ascending=False).values
sns.countplot(data=df,x='diagnosis')
<AxesSubplot:xlabel='diagnosis', ylabel='count'>
# Helper function to create pairplot for features
def plotGrid(df, y, features):
features.append(y)
df_plot = df[features]
sns.pairplot(data = df_plot, hue = y)
# Plot all mean features
plotGrid(df, 'diagnosis', mean_features)
# Plot all worst features
plotGrid(df, 'diagnosis', worst_features)
# Plot all feature standard errors
plotGrid(df, 'diagnosis', se_features)
From the pairplots, we can tell that not all the features available in the dataset are especially relevant in determining if a tumour is benign or malignant. Some of the graphs show that the particular feature has a similar distribution of results whether the tumour is benign or malignant, for example, the fractal_dimension_se feature.
fig, ax = plt.subplots(ncols=5, nrows=2, figsize=(16,8))
ax = ax.flatten()
for idx, col in enumerate(mean_df.columns):
sns.histplot(data=mean_df, x=col,kde=True,ax=ax[idx])
plt.tight_layout(pad=0.5,h_pad=0.8,w_pad=0.5)
fig, ax = plt.subplots(ncols=5, nrows=2, figsize=(16,8))
ax = ax.flatten()
for idx, col in enumerate(worst_df.columns):
sns.histplot(data=worst_df, x=col,kde=True,ax=ax[idx])
plt.tight_layout(pad=0.5,h_pad=0.8,w_pad=0.5)
fig, ax = plt.subplots(ncols=5, nrows=2, figsize=(16,8))
ax = ax.flatten()
for idx, col in enumerate(se_df.columns):
sns.histplot(data=se_df, x=col,kde=True,ax=ax[idx])
plt.tight_layout(pad=0.5,h_pad=0.8,w_pad=0.5)
# Label Encoding
encoder = LabelEncoder()
df['diagnosis'] = encoder.fit_transform(df['diagnosis'])
y = df[['diagnosis']]
y
| diagnosis | |
|---|---|
| 0 | 1 |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| ... | ... |
| 564 | 1 |
| 565 | 1 |
| 566 | 1 |
| 567 | 1 |
| 568 | 0 |
569 rows × 1 columns
First, we scale the model using the minmax scaler sklearn provides. This scaler was chosen as the output of the data will all be between 0 and 1, which works for our purposes as some of the machine learning models cannot accomodate negative input data.
# Get features and scale
scaler = MinMaxScaler()
x = df.drop(['id', 'diagnosis'], axis = 1)
scaler.fit(x)
x = scaler.transform(x)
x
array([[0.52103744, 0.0226581 , 0.54598853, ..., 0.91202749, 0.59846245,
0.41886396],
[0.64314449, 0.27257355, 0.61578329, ..., 0.63917526, 0.23358959,
0.22287813],
[0.60149557, 0.3902604 , 0.59574321, ..., 0.83505155, 0.40370589,
0.21343303],
...,
[0.45525108, 0.62123774, 0.44578813, ..., 0.48728522, 0.12872068,
0.1519087 ],
[0.64456434, 0.66351031, 0.66553797, ..., 0.91065292, 0.49714173,
0.45231536],
[0.03686876, 0.50152181, 0.02853984, ..., 0. , 0.25744136,
0.10068215]])
# Split the dataset into training, testing, and validation sets, at a ratio of 2:7:1
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size = 0.2, random_state = state)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.125, random_state=state)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)
y_val = np.ravel(y_val)
X_train.shape, y_train.shape, X_test.shape, y_test.shape, X_val.shape, y_val.shape
((398, 30), (398,), (114, 30), (114,), (57, 30), (57,))
Earlier, it was established that not all features may be helpful in solving this problem. Therefore, the more relevant featurers will have to be picked out for use. However, conjecture based on visual information is insufficient to disqualify features, and thus we will be using the 'SelectKBest' module available in scikit-learn, which calculates the chi-squared value for the features and uses it to select the k best features for use with machine learning models.
# Find 10 best scored features
n_features=10
select_feature = SelectKBest(chi2, k=n_features).fit(X_train, y_train)
X_train_selected = select_feature.transform(X_train)
X_val_selected = select_feature.transform(X_val)
X_test_selected = select_feature.transform(X_test)
X_train_selected.shape, X_val_selected.shape, X_test_selected.shape
((398, 10), (57, 10), (114, 10))
We will use a pandas dataframe to store the final results.
index = 1
results_df = pd.DataFrame(columns=['Index','Model Name','Training Set Recall','Training Set Accuracy','Testing Set Recall','Testing Set Accuracy'])
results_df
| Index | Model Name | Training Set Recall | Training Set Accuracy | Testing Set Recall | Testing Set Accuracy |
|---|
# Helper function to insert results into results dataframe
def insert_results(name, r1, r2, r3, r4):
global index, results_df
results_df = pd.concat([results_df, pd.DataFrame({'Index': [index], 'Model Name': [name],
'Training Set Recall':[r1],'Training Set Accuracy':[r2],'Testing Set Recall':[r3],'Testing Set Accuracy':[r4]})])
index += 1
The first machine learning model we will be evaluating will be the K-Nearest Neighbours algorithm. The K-Nearest Neighbours Algorithm classifies entries by looking at each data point and searching for the k nearest data points to the data point, then decides which class they are by the majority classification among those k points. First, we will run the model on default parameters.
knn_model = KNeighborsClassifier()
knn_model.fit(X_train_selected,y_train)
knn_model.get_params()
{'algorithm': 'auto',
'leaf_size': 30,
'metric': 'minkowski',
'metric_params': None,
'n_jobs': None,
'n_neighbors': 5,
'p': 2,
'weights': 'uniform'}
# Training set results
y_pred = knn_model.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
(0.9103448275862069, 0.957286432160804)
# Testing set results
y_pred = knn_model.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
(0.9130434782608695, 0.956140350877193)
Now, we will use grid search cross-validation to tune the hyper parameters we have. We do this by compiling a reasonable parameter grid - in this case, we will try different values for n_neighbours, the number of neighbours that the KNN algorithm will take into consideration, and the weights system, which determines how heavily different data points are weighted.
parameters = {"n_neighbors":np.linspace(1,10,10).astype(int), "weights":["uniform","distance"]}
knn_optimised = GridSearchCV(knn_model, parameters, cv=5,scoring="recall")
knn_optimised.fit(X_train_selected, y_train)
knn_optimised.best_estimator_.get_params()
{'algorithm': 'auto',
'leaf_size': 30,
'metric': 'minkowski',
'metric_params': None,
'n_jobs': None,
'n_neighbors': 1,
'p': 2,
'weights': 'uniform'}
As we can see, the grid search has returned a model where the n_neighbours hyperparameter has changed. Now, we will test the new model by evaluating its performance against the training and testing set.
# Training set results
y_pred = knn_optimised.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
(1.0, 1.0)
# Testing set results
y_pred = knn_optimised.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
(0.9130434782608695, 0.9385964912280702)
The performance of the model on the training set has increased to 100%. This suggests that the model may be overfitted. However, the performance on the testing set has gotten worse, with the accuracy decreasing from 95.6% to 93.9%. In this case, we should repeat the tuning on the hyperparameters, but avoid the change made with this tuning, as it seems to have caused the model to become overfitted.
parameters = {"n_neighbors":np.linspace(2,11,10).astype(int), "weights":["uniform","distance"]}
knn_optimised = GridSearchCV(knn_model, parameters, cv=5,scoring="recall")
knn_optimised.fit(X_train_selected, y_train)
knn_optimised.best_estimator_.get_params()
{'algorithm': 'auto',
'leaf_size': 30,
'metric': 'minkowski',
'metric_params': None,
'n_jobs': None,
'n_neighbors': 2,
'p': 2,
'weights': 'distance'}
# Testing set results
y_pred = knn_optimised.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
(1.0, 1.0)
# Testing set results
y_pred = knn_optimised.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
(0.9130434782608695, 0.9385964912280702)
parameters = {"n_neighbors":np.linspace(3,12,10).astype(int), "weights":["uniform","distance"]}
knn_optimised = GridSearchCV(knn_model, parameters, cv=5,scoring="recall")
knn_optimised.fit(X_train_selected, y_train)
knn_optimised.best_estimator_.get_params()
{'algorithm': 'auto',
'leaf_size': 30,
'metric': 'minkowski',
'metric_params': None,
'n_jobs': None,
'n_neighbors': 3,
'p': 2,
'weights': 'uniform'}
y_pred = knn_optimised.predict(X_train_selected)
train_recall = recall_score(y_train, y_pred)
train_accuracy = accuracy_score(y_train, y_pred)
(train_recall, train_accuracy)
(0.9379310344827586, 0.964824120603015)
y_pred = knn_optimised.predict(X_test_selected)
test_recall = recall_score(y_test, y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
(0.9565217391304348, 0.9649122807017544)
insert_results("K-Nearest Neighbours", train_recall, train_accuracy, test_recall, test_accuracy)
After repeating the grid search twice, avoiding the values of 1 and 2 on the n_neighbours hyperparameter as they cause the model to be overfitted, we find that n_neighbours = 3 causes the model to perform much better than any of the previous configurations on the testing set. While it also causes the performance on the training set to worsen, that indicates that the model is not overfitted, and can generalise better.
The next machine learning model we will be evaluating will be a Support Vector Classifier. Support Vector Classifiers use hyperplanes to define decision boundaries, which they use to solve classification problems in high dimensiona spaces. First, we will run the model on default parameters.
svc_model = SVC(gamma='auto', class_weight={0: class_counts[0], 1: class_counts[1]}, random_state=state)
svc_model.fit(X_train_selected,y_train)
svc_model.get_params()
{'C': 1.0,
'break_ties': False,
'cache_size': 200,
'class_weight': {0: 357, 1: 212},
'coef0': 0.0,
'decision_function_shape': 'ovr',
'degree': 3,
'gamma': 'auto',
'kernel': 'rbf',
'max_iter': -1,
'probability': False,
'random_state': 73,
'shrinking': True,
'tol': 0.001,
'verbose': False}
y_pred = svc_model.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
(0.8758620689655172, 0.9547738693467337)
y_pred = svc_model.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
(0.8913043478260869, 0.956140350877193)
cm = confusion_matrix(y_test, y_pred)
cm
array([[68, 0],
[ 5, 41]], dtype=int64)
We can see that the support vector classified performed worse than the K Nearest Neighbours algorithm under default parameters.
Next, we will again use grid search cross-validation to tune the hyper parameters we have. In this case, we will try different values for C, the regularisation parameter, which controls the margin of error allowed to the support vector classifier in the construction of the hyperplane, coef0, the coefficient, and the kernel, the type of function used to construct the hyperplane.
parameters = {'C': np.logspace(-3, 3, 100),'kernel': ['linear', 'sigmoid'],
'gamma':['scale', 'auto'],'coef0':np.linspace(0, 10, 10).astype(int)}
svc_optimised = RandomizedSearchCV(svc_model, parameters,scoring="recall",random_state=state)
svc_optimised.fit(X_train_selected, y_train)
svc_optimised.best_estimator_.get_params()
{'C': 657.9332246575682,
'break_ties': False,
'cache_size': 200,
'class_weight': {0: 357, 1: 212},
'coef0': 4,
'decision_function_shape': 'ovr',
'degree': 3,
'gamma': 'scale',
'kernel': 'linear',
'max_iter': -1,
'probability': False,
'random_state': 73,
'shrinking': True,
'tol': 0.001,
'verbose': False}
y_pred = svc_optimised.predict(X_train_selected)
train_recall = recall_score(y_train, y_pred)
train_accuracy = accuracy_score(y_train, y_pred)
(train_recall, train_accuracy)
(0.9448275862068966, 0.9748743718592965)
y_pred = svc_optimised.predict(X_test_selected)
test_recall = recall_score(y_test, y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
(0.9130434782608695, 0.956140350877193)
insert_results("Support Vector Classifier", train_recall, train_accuracy, test_recall, test_accuracy)
In this case, grid search gave a configuration for the model that sees a marked improvement in the performace of the model, bumping the up performance scores of the model on both the training and testing sets, though the accuracy of the model on the testing set remained constant.
Next, we will use the Random Forest Classifier to tackle this problem. The Random Forest Classifier is a machine learning algorithm that combines multiple decision trees to make predictions on input data provided to it. Firstly, we will test it on the problem with default parameters, only giving it the weights of the classes to go off of.
rf_model = RandomForestClassifier(
random_state = state,
class_weight={0: class_counts[0], 1: class_counts[1]})
rf_model.fit(X_train_selected,y_train)
rf_model.get_params()
{'bootstrap': True,
'ccp_alpha': 0.0,
'class_weight': {0: 357, 1: 212},
'criterion': 'gini',
'max_depth': None,
'max_features': 'auto',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 100,
'n_jobs': None,
'oob_score': False,
'random_state': 73,
'verbose': 0,
'warm_start': False}
y_pred = rf_model.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
(1.0, 1.0)
y_pred = rf_model.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
(0.8913043478260869, 0.9385964912280702)
As we can see, the Random Forest Classifier achieves a 100% recall on default parameters. This could be a sign that the model is overfitted, and we will need to move on to the standard hyperparameter tuning. However, unlike the previous models, we cannot just use GridSearchCV as the model is already performing at 100% for the training set, and thus there is no further room for improvement by GridSearchCV's standards.
With that in mind, the approach will be to perform manual hyperparameter tuning, focusing on trying to reduce overfitting. The first thing we can try to reduce overfitting will be to lower the number of estimators in the random forest from the default value of 100. We will first try to establish a lower bound for the value of n_estimators by dramatically reducing the value until the recall score of the random forest classifier on the training set drops below 100%, or the recall score on the test set drops below the initial value of 0.891.
rf_model = RandomForestClassifier(
random_state = state,
n_estimators = 20,
class_weight={0: class_counts[0], 1: class_counts[1]})
rf_model.fit(X_train_selected,y_train)
rf_model.get_params()
{'bootstrap': True,
'ccp_alpha': 0.0,
'class_weight': {0: 357, 1: 212},
'criterion': 'gini',
'max_depth': None,
'max_features': 'auto',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 20,
'n_jobs': None,
'oob_score': False,
'random_state': 73,
'verbose': 0,
'warm_start': False}
y_pred = rf_model.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
(1.0, 1.0)
y_pred = rf_model.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
(0.8913043478260869, 0.9298245614035088)
rf_model = RandomForestClassifier(
random_state = state,
n_estimators = 5,
class_weight={0: class_counts[0], 1: class_counts[1]})
rf_model.fit(X_train_selected,y_train)
rf_model.get_params()
{'bootstrap': True,
'ccp_alpha': 0.0,
'class_weight': {0: 357, 1: 212},
'criterion': 'gini',
'max_depth': None,
'max_features': 'auto',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 5,
'n_jobs': None,
'oob_score': False,
'random_state': 73,
'verbose': 0,
'warm_start': False}
y_pred = rf_model.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
(0.9517241379310345, 0.9798994974874372)
y_pred = rf_model.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
(0.9130434782608695, 0.9385964912280702)
In this case, reducing the number of estimators to 5 reduces the recall score on the training set, but caused the recall score of the testing set to increase. This indicates that the model has gotten better at generalising its results, and is no longer as overfittedd as before. However, this does not mean that it is now optimised, though it does mean that we can now employ methods like GridSearchCV to try to optimise its parameters.
parameters = {'min_samples_leaf': [1,2,3,4], 'n_estimators': np.linspace(2,8,num=7).astype(int), 'max_depth': [10,20,None]}
rf_optimised = GridSearchCV(rf_model, parameters,scoring="recall", refit=True)
rf_optimised.fit(X_train_selected, y_train)
rf_optimised.best_estimator_.get_params()
{'bootstrap': True,
'ccp_alpha': 0.0,
'class_weight': {0: 357, 1: 212},
'criterion': 'gini',
'max_depth': 10,
'max_features': 'auto',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 3,
'n_jobs': None,
'oob_score': False,
'random_state': 73,
'verbose': 0,
'warm_start': False}
y_pred = rf_optimised.predict(X_train_selected)
train_recall = recall_score(y_train, y_pred)
train_accuracy = accuracy_score(y_train, y_pred)
(train_recall, train_accuracy)
(0.9586206896551724, 0.9798994974874372)
y_pred = rf_optimised.predict(X_test_selected)
test_recall = recall_score(y_test, y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
(0.9565217391304348, 0.9473684210526315)
insert_results("Random Forest Classifier", train_recall, train_accuracy, test_recall, test_accuracy)
In this case, the grid search cross validation similarly boosted scores across the board, giving us a significant increase in both the recall and accuracy scores on the testing set.
The next machine learning model that will be explored is the Naive Bayes classifier. The Naive Bayes classifier is a machine learning algorithm that uses Bayes' theorem to classify a data point, that assumes that features are independent.
# Build a Gaussian Classifier
nb_model = GaussianNB()
# Model training
nb_model.fit(X_train_selected,y_train)
nb_model.get_params()
{'priors': None, 'var_smoothing': 1e-09}
y_pred = nb_model.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
(0.9172413793103448, 0.9422110552763819)
y_pred = nb_model.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
(0.9565217391304348, 0.956140350877193)
Now for hyperparameter testing. The Naive Bayes classifier does not have many hyperparameters that require tuning, only var_smoothing, which is the variable that adds small value to the variances of the features.
parameters = {'var_smoothing': np.logspace(0,-10, num=100)}
nb_optimised = GridSearchCV(nb_model, parameters,scoring="recall")
nb_optimised.fit(X_train_selected, y_train)
nb_optimised.best_estimator_.get_params()
{'priors': None, 'var_smoothing': 0.01519911082952934}
y_pred = nb_optimised.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
(0.903448275862069, 0.9371859296482412)
y_pred = nb_optimised.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
(0.9347826086956522, 0.9473684210526315)
As we can see, a larger value of var_smoothing seems to make the model perform worse. With that in mind, we can retry the tuning using a range of numbers that center on the default value.
parameters = {'var_smoothing': np.logspace(-4,-14, num=100)}
nb_optimised = GridSearchCV(nb_model, parameters,scoring="recall")
nb_optimised.fit(X_train_selected, y_train)
nb_optimised.best_estimator_.get_params()
{'priors': None, 'var_smoothing': 0.0001}
y_pred = nb_optimised.predict(X_train_selected)
train_recall = recall_score(y_train, y_pred)
train_accuracy = accuracy_score(y_train, y_pred)
(train_recall, train_accuracy)
(0.9172413793103448, 0.9422110552763819)
y_pred = nb_optimised.predict(X_test_selected)
test_recall = recall_score(y_test, y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
(0.9565217391304348, 0.956140350877193)
insert_results("Naive Bayes Classifier", train_recall, train_accuracy, test_recall, test_accuracy)
In this case, it would seem that the default value for the hyperparameter performed the best.
The final machine learning algorithm we will be looking at is logistic regression. Logistic regression is a statistical model that estimates the probability of a data point belonging to each class by fitting a logistic function to the data.
lr_model = LogisticRegression(class_weight={0: class_counts[0], 1: class_counts[1]}, random_state=state, max_iter=5000)
lr_model.fit(X_train_selected,y_train)
lr_model.get_params()
{'C': 1.0,
'class_weight': {0: 357, 1: 212},
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'l1_ratio': None,
'max_iter': 5000,
'multi_class': 'auto',
'n_jobs': None,
'penalty': 'l2',
'random_state': 73,
'solver': 'lbfgs',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
y_pred = lr_model.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
(0.9310344827586207, 0.9723618090452262)
y_pred = lr_model.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
(0.9130434782608695, 0.956140350877193)
In the case of logistic regression, the hyperparameters that need to be looked at have a larger range of values than the previous models. This means that grid search cross validation will take a bit more time to run its course than for the others. We will need to look at a value of C, the regularisation parameter, penalty, the type of regularisation, and solver, different optimisation algorithms for fitting the model.
parameters = {'C': np.logspace(-2, 2, 99)
,'penalty': ['l1', 'l2', 'elasticnet']
,'solver':['lbfgs'
, 'liblinear'
, 'newton-cg'
, 'newton-cholesky'
]
}
lr_optimised = GridSearchCV(lr_model, parameters,scoring="recall")
lr_optimised.fit(X_train_selected, y_train)
lr_optimised.best_estimator_.get_params()
F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:372: FitFailedWarning:
3960 fits failed out of a total of 5940.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
495 fits failed with the following error:
Traceback (most recent call last):
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 447, in _check_solver
raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.
--------------------------------------------------------------------------------
495 fits failed with the following error:
Traceback (most recent call last):
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 447, in _check_solver
raise ValueError(
ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty.
--------------------------------------------------------------------------------
1485 fits failed with the following error:
Traceback (most recent call last):
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 434, in _check_solver
raise ValueError(
ValueError: Logistic Regression supports only solvers in ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga'], got newton-cholesky.
--------------------------------------------------------------------------------
495 fits failed with the following error:
Traceback (most recent call last):
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 447, in _check_solver
raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.
--------------------------------------------------------------------------------
495 fits failed with the following error:
Traceback (most recent call last):
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 457, in _check_solver
raise ValueError(
ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear.
--------------------------------------------------------------------------------
495 fits failed with the following error:
Traceback (most recent call last):
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 447, in _check_solver
raise ValueError(
ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got elasticnet penalty.
warnings.warn(some_fits_failed_message, FitFailedWarning)
F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\model_selection\_search.py:969: UserWarning: One or more of the test scores are non-finite: [ nan 0.86206897 nan ... nan nan nan]
warnings.warn(
{'C': 0.05963623316594643,
'class_weight': {0: 357, 1: 212},
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'l1_ratio': None,
'max_iter': 5000,
'multi_class': 'auto',
'n_jobs': None,
'penalty': 'l1',
'random_state': 73,
'solver': 'liblinear',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
In this case, we are getting many warnings. This is due to the fact that some hyperparameters are inherently incompatible with others - for example, the 'newton-cg' solver with the 'elasticnet' penalty. The algorithm ignores the incompatible permutations of hyperparameters.
y_pred = lr_optimised.predict(X_train_selected)
train_recall = recall_score(y_train, y_pred)
train_accuracy = accuracy_score(y_train, y_pred)
(train_recall, train_accuracy)
(0.9310344827586207, 0.9698492462311558)
y_pred = lr_optimised.predict(X_test_selected)
test_recall = recall_score(y_test, y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
(0.9130434782608695, 0.956140350877193)
y_pred = lr_model.predict(X_train_selected)
train_recall = recall_score(y_train, y_pred)
train_accuracy = accuracy_score(y_train, y_pred)
(train_recall, train_accuracy)
(0.9310344827586207, 0.9723618090452262)
y_pred = lr_model.predict(X_test_selected)
test_recall = recall_score(y_test, y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
(0.9130434782608695, 0.956140350877193)
insert_results("Logistic Regression", train_recall, train_accuracy, test_recall, test_accuracy)
In this case, it seems that the grid search cross-validation returned a value worse than that of the default parameters, therefore we will take the results obtained from the default.
Next, we will be experimenting with a single layer perceptron. A single layer perceptron is the simplest form of a neural network, comprising only one input layer and one output layer, with no hidden layers. We will be building this single layer perceptron with 256 neurons in the input layer.
model_0 = Sequential([
Dense(128,activation='relu',input_shape=(n_features,)),
Dense(1,activation='sigmoid')])
model_0.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 128) 1408
dense_1 (Dense) (None, 1) 129
=================================================================
Total params: 1,537
Trainable params: 1,537
Non-trainable params: 0
_________________________________________________________________
model_0.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),loss='binary_crossentropy',metrics=['Accuracy', 'Recall'])
n_epoch = 20
history_0 = model_0.fit(x=X_train_selected,y=y_train,
validation_data=(X_val_selected, y_val),
epochs=n_epoch)
Epoch 1/20 13/13 [==============================] - 2s 34ms/step - loss: 0.6883 - Accuracy: 0.7638 - recall: 0.3862 - val_loss: 0.6727 - val_Accuracy: 0.8947 - val_recall: 0.9524 Epoch 2/20 13/13 [==============================] - 0s 4ms/step - loss: 0.6583 - Accuracy: 0.8643 - recall: 0.9724 - val_loss: 0.6433 - val_Accuracy: 0.8596 - val_recall: 0.9524 Epoch 3/20 13/13 [==============================] - 0s 4ms/step - loss: 0.6292 - Accuracy: 0.8618 - recall: 0.9793 - val_loss: 0.6137 - val_Accuracy: 0.8947 - val_recall: 0.9524 Epoch 4/20 13/13 [==============================] - 0s 4ms/step - loss: 0.5983 - Accuracy: 0.8894 - recall: 0.9586 - val_loss: 0.5812 - val_Accuracy: 0.9474 - val_recall: 0.9524 Epoch 5/20 13/13 [==============================] - 0s 4ms/step - loss: 0.5653 - Accuracy: 0.9347 - recall: 0.9103 - val_loss: 0.5454 - val_Accuracy: 0.9298 - val_recall: 0.9048 Epoch 6/20 13/13 [==============================] - 0s 4ms/step - loss: 0.5287 - Accuracy: 0.9372 - recall: 0.9172 - val_loss: 0.5073 - val_Accuracy: 0.9474 - val_recall: 0.9524 Epoch 7/20 13/13 [==============================] - 0s 4ms/step - loss: 0.4901 - Accuracy: 0.9347 - recall: 0.8828 - val_loss: 0.4659 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 8/20 13/13 [==============================] - 0s 4ms/step - loss: 0.4503 - Accuracy: 0.9322 - recall: 0.8690 - val_loss: 0.4259 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 9/20 13/13 [==============================] - 0s 4ms/step - loss: 0.4128 - Accuracy: 0.9296 - recall: 0.8552 - val_loss: 0.3884 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 10/20 13/13 [==============================] - 0s 4ms/step - loss: 0.3784 - Accuracy: 0.9372 - recall: 0.8828 - val_loss: 0.3549 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 11/20 13/13 [==============================] - 0s 8ms/step - loss: 0.3471 - Accuracy: 0.9397 - recall: 0.8897 - val_loss: 0.3248 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 12/20 13/13 [==============================] - 0s 4ms/step - loss: 0.3198 - Accuracy: 0.9322 - recall: 0.8621 - val_loss: 0.2967 - val_Accuracy: 0.9649 - val_recall: 0.9048 Epoch 13/20 13/13 [==============================] - 0s 5ms/step - loss: 0.2963 - Accuracy: 0.9322 - recall: 0.8621 - val_loss: 0.2744 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 14/20 13/13 [==============================] - 0s 4ms/step - loss: 0.2758 - Accuracy: 0.9347 - recall: 0.8690 - val_loss: 0.2563 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 15/20 13/13 [==============================] - 0s 4ms/step - loss: 0.2587 - Accuracy: 0.9347 - recall: 0.8621 - val_loss: 0.2376 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 16/20 13/13 [==============================] - 0s 4ms/step - loss: 0.2444 - Accuracy: 0.9347 - recall: 0.8621 - val_loss: 0.2237 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 17/20 13/13 [==============================] - 0s 4ms/step - loss: 0.2317 - Accuracy: 0.9372 - recall: 0.8690 - val_loss: 0.2123 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 18/20 13/13 [==============================] - 0s 4ms/step - loss: 0.2227 - Accuracy: 0.9397 - recall: 0.8828 - val_loss: 0.2038 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 19/20 13/13 [==============================] - 0s 5ms/step - loss: 0.2131 - Accuracy: 0.9372 - recall: 0.8690 - val_loss: 0.1936 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 20/20 13/13 [==============================] - 0s 4ms/step - loss: 0.2060 - Accuracy: 0.9372 - recall: 0.8690 - val_loss: 0.1872 - val_Accuracy: 0.9474 - val_recall: 0.9048
After training the model, we will plot out the training data, and look to see if there
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history_0.history["Accuracy"], label="train_acc")
plt.plot(np.arange(0, n_epoch), history_0.history["val_Accuracy"], label="val_acc")
plt.title("Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Accuracy")
plt.legend()
<matplotlib.legend.Legend at 0x13846b0fa30>
From the accuracy graph, we can see that the training model has been fitted to the data quite well, as the training set accuracy and validation set accuracy are quite close to each other.
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history_0.history["loss"], label="train_loss")
plt.plot(np.arange(0, n_epoch), history_0.history["val_loss"], label="val_loss")
plt.title("Training Loss")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.legend()
<matplotlib.legend.Legend at 0x13846ffc070>
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history_0.history["recall"], label="train_recall")
plt.plot(np.arange(0, n_epoch), history_0.history["val_recall"], label="val_recall")
plt.title("Training Recall")
plt.xlabel("Epoch #")
plt.ylabel("Recall")
plt.legend()
<matplotlib.legend.Legend at 0x138482cbbe0>
y_pred = model_0.predict(X_test_selected)
y_pred = [1 if y > 0.5 else 0 for y in y_pred]
test_recall = recall_score(y_test,y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
4/4 [==============================] - 0s 1ms/step
(0.9347826086956522, 0.9473684210526315)
insert_results("Single Layer Perceptron", history_0.history["recall"][-1], history_0.history["Accuracy"][-1], test_recall, test_accuracy)
From this, we can see that the single layer perceptron exhibited a worse performance than some of the machine learning algorithms. It is likely that a multilayer perceptron, being more complex and containing regularisation techniques like dropout layers, will have a better performance.
Following that, we will be looking at the multilayer perceptron, which is a type of feedforward neural network comprising multiple layers. The one we will be building will be a simple one one input layer, one output layer, and one hidden layer. It will also contain dropout layers, which help the model to prevent overfitting.
model_1 = Sequential([
Dense(128,activation='relu',input_shape=(n_features,)),
Dropout(0.2),
Dense(64,activation='relu'),
Dropout(0.2),
Dense(1,activation='sigmoid')])
model_1.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_2 (Dense) (None, 128) 1408
dropout (Dropout) (None, 128) 0
dense_3 (Dense) (None, 64) 8256
dropout_1 (Dropout) (None, 64) 0
dense_4 (Dense) (None, 1) 65
=================================================================
Total params: 9,729
Trainable params: 9,729
Non-trainable params: 0
_________________________________________________________________
model_1.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),loss='binary_crossentropy',metrics=['Accuracy', 'Recall'])
n_epoch = 20
history = model_1.fit(x=X_train_selected,y=y_train,
validation_data=(X_val_selected, y_val),
epochs=n_epoch)
Epoch 1/20 13/13 [==============================] - 2s 44ms/step - loss: 0.6775 - Accuracy: 0.5779 - recall: 0.8345 - val_loss: 0.6532 - val_Accuracy: 0.4561 - val_recall: 1.0000 Epoch 2/20 13/13 [==============================] - 0s 7ms/step - loss: 0.6377 - Accuracy: 0.7085 - recall: 0.9655 - val_loss: 0.6042 - val_Accuracy: 0.7895 - val_recall: 0.9524 Epoch 3/20 13/13 [==============================] - 0s 4ms/step - loss: 0.5772 - Accuracy: 0.8693 - recall: 0.9448 - val_loss: 0.5307 - val_Accuracy: 0.9298 - val_recall: 0.9524 Epoch 4/20 13/13 [==============================] - 0s 5ms/step - loss: 0.5034 - Accuracy: 0.9221 - recall: 0.8690 - val_loss: 0.4389 - val_Accuracy: 0.9123 - val_recall: 0.8571 Epoch 5/20 13/13 [==============================] - 0s 4ms/step - loss: 0.4043 - Accuracy: 0.9372 - recall: 0.9034 - val_loss: 0.3386 - val_Accuracy: 0.9298 - val_recall: 0.9048 Epoch 6/20 13/13 [==============================] - 0s 4ms/step - loss: 0.3214 - Accuracy: 0.9246 - recall: 0.8759 - val_loss: 0.2573 - val_Accuracy: 0.9474 - val_recall: 0.8571 Epoch 7/20 13/13 [==============================] - 0s 4ms/step - loss: 0.2521 - Accuracy: 0.9246 - recall: 0.8345 - val_loss: 0.2057 - val_Accuracy: 0.9298 - val_recall: 0.9048 Epoch 8/20 13/13 [==============================] - 0s 4ms/step - loss: 0.2096 - Accuracy: 0.9422 - recall: 0.8759 - val_loss: 0.1733 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 9/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1885 - Accuracy: 0.9397 - recall: 0.8897 - val_loss: 0.1563 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 10/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1798 - Accuracy: 0.9296 - recall: 0.8621 - val_loss: 0.1462 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 11/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1718 - Accuracy: 0.9397 - recall: 0.9241 - val_loss: 0.1421 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 12/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1553 - Accuracy: 0.9422 - recall: 0.8828 - val_loss: 0.1383 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 13/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1503 - Accuracy: 0.9397 - recall: 0.8966 - val_loss: 0.1367 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 14/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1494 - Accuracy: 0.9447 - recall: 0.9034 - val_loss: 0.1367 - val_Accuracy: 0.9649 - val_recall: 0.9524 Epoch 15/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1461 - Accuracy: 0.9472 - recall: 0.9241 - val_loss: 0.1350 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 16/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1511 - Accuracy: 0.9372 - recall: 0.8759 - val_loss: 0.1354 - val_Accuracy: 0.9649 - val_recall: 0.9524 Epoch 17/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1386 - Accuracy: 0.9497 - recall: 0.8966 - val_loss: 0.1353 - val_Accuracy: 0.9649 - val_recall: 0.9524 Epoch 18/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1505 - Accuracy: 0.9322 - recall: 0.9379 - val_loss: 0.1351 - val_Accuracy: 0.9474 - val_recall: 0.9048 Epoch 19/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1360 - Accuracy: 0.9422 - recall: 0.8897 - val_loss: 0.1353 - val_Accuracy: 0.9649 - val_recall: 0.9524 Epoch 20/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1456 - Accuracy: 0.9497 - recall: 0.9172 - val_loss: 0.1369 - val_Accuracy: 0.9474 - val_recall: 0.9524
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["Accuracy"], label="train_acc")
plt.plot(np.arange(0, n_epoch), history.history["val_Accuracy"], label="val_acc")
plt.title("Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Accuracy")
plt.legend()
<matplotlib.legend.Legend at 0x1384ba65370>
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["loss"], label="train_loss")
plt.plot(np.arange(0, n_epoch), history.history["val_loss"], label="val_loss")
plt.title("Training Loss")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.legend()
<matplotlib.legend.Legend at 0x138493888b0>
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["recall"], label="train_recall")
plt.plot(np.arange(0, n_epoch), history.history["val_recall"], label="val_recall")
plt.title("Training Recall")
plt.xlabel("Epoch #")
plt.ylabel("Recall")
plt.legend()
<matplotlib.legend.Legend at 0x1384ac96e50>
y_pred = model_1.predict(X_test_selected)
y_pred = [1 if y > 0.5 else 0 for y in y_pred]
test_recall = recall_score(y_test,y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
4/4 [==============================] - 0s 1ms/step
(0.9565217391304348, 0.956140350877193)
insert_results("Multilayer Perceptron", history_0.history["recall"][-1], history_0.history["Accuracy"][-1], test_recall, test_accuracy)
Finally, we will be working with a convolutional neural network. A convolutional neural network (CNN), is a type of neural network that has a convolutional layer for its input layer, and specialises in processing data structured in grid form, like image data.
As a convolutional neural network requires 3-dimensional data as input, we will reshape our input data to add an extra dimension.
X_train_3d = X_train_selected.reshape(X_train_selected.shape[0], X_train_selected.shape[1], 1)
X_test_3d = X_test_selected.reshape(X_test_selected.shape[0], X_test_selected.shape[1], 1)
X_val_3d = X_val_selected.reshape(X_val_selected.shape[0], X_val_selected.shape[1], 1)
X_train_3d.shape, X_test_3d.shape, X_val_3d.shape
((398, 10, 1), (114, 10, 1), (57, 10, 1))
cnn_model = Sequential([
Conv1D(filters=32,kernel_size=2,activation='relu',input_shape=(n_features,1)),
Dropout(0.2),
Flatten(),
Dense(64, activation='relu'),
Dropout(0.3),
Dense(1,activation='sigmoid')
])
cnn_model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d (Conv1D) (None, 9, 32) 96
dropout_2 (Dropout) (None, 9, 32) 0
flatten (Flatten) (None, 288) 0
dense_5 (Dense) (None, 64) 18496
dropout_3 (Dropout) (None, 64) 0
dense_6 (Dense) (None, 1) 65
=================================================================
Total params: 18,657
Trainable params: 18,657
Non-trainable params: 0
_________________________________________________________________
cnn_model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),loss='binary_crossentropy',metrics=['Recall','Accuracy'])
# Fit the model
history = cnn_model.fit(X_train_3d, y_train,
validation_data=(X_val_3d, y_val),
epochs=n_epoch)
Epoch 1/20 13/13 [==============================] - 2s 26ms/step - loss: 0.6691 - recall: 0.9655 - Accuracy: 0.4598 - val_loss: 0.6401 - val_recall: 1.0000 - val_Accuracy: 0.4561 Epoch 2/20 13/13 [==============================] - 0s 5ms/step - loss: 0.6215 - recall: 0.9862 - Accuracy: 0.6683 - val_loss: 0.5831 - val_recall: 0.9524 - val_Accuracy: 0.7895 Epoch 3/20 13/13 [==============================] - 0s 5ms/step - loss: 0.5577 - recall: 0.9586 - Accuracy: 0.8643 - val_loss: 0.4962 - val_recall: 0.9524 - val_Accuracy: 0.9123 Epoch 4/20 13/13 [==============================] - 0s 4ms/step - loss: 0.4684 - recall: 0.8690 - Accuracy: 0.9196 - val_loss: 0.3857 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 5/20 13/13 [==============================] - 0s 5ms/step - loss: 0.3668 - recall: 0.8690 - Accuracy: 0.9246 - val_loss: 0.2825 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 6/20 13/13 [==============================] - 0s 4ms/step - loss: 0.2752 - recall: 0.8690 - Accuracy: 0.9296 - val_loss: 0.2125 - val_recall: 0.9048 - val_Accuracy: 0.9649 Epoch 7/20 13/13 [==============================] - 0s 5ms/step - loss: 0.2327 - recall: 0.8345 - Accuracy: 0.9196 - val_loss: 0.1768 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 8/20 13/13 [==============================] - 0s 6ms/step - loss: 0.1946 - recall: 0.8897 - Accuracy: 0.9397 - val_loss: 0.1579 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 9/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1846 - recall: 0.9034 - Accuracy: 0.9372 - val_loss: 0.1508 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 10/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1751 - recall: 0.8759 - Accuracy: 0.9347 - val_loss: 0.1463 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 11/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1763 - recall: 0.9103 - Accuracy: 0.9347 - val_loss: 0.1443 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 12/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1625 - recall: 0.8621 - Accuracy: 0.9322 - val_loss: 0.1423 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 13/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1653 - recall: 0.8690 - Accuracy: 0.9347 - val_loss: 0.1421 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 14/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1475 - recall: 0.9034 - Accuracy: 0.9447 - val_loss: 0.1415 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 15/20 13/13 [==============================] - 0s 6ms/step - loss: 0.1580 - recall: 0.8897 - Accuracy: 0.9372 - val_loss: 0.1413 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 16/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1630 - recall: 0.8897 - Accuracy: 0.9397 - val_loss: 0.1412 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 17/20 13/13 [==============================] - 0s 7ms/step - loss: 0.1590 - recall: 0.8897 - Accuracy: 0.9322 - val_loss: 0.1427 - val_recall: 0.9048 - val_Accuracy: 0.9298 Epoch 18/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1560 - recall: 0.9379 - Accuracy: 0.9347 - val_loss: 0.1416 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 19/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1617 - recall: 0.8690 - Accuracy: 0.9246 - val_loss: 0.1417 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 20/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1640 - recall: 0.8759 - Accuracy: 0.9322 - val_loss: 0.1417 - val_recall: 0.9048 - val_Accuracy: 0.9298
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["Accuracy"], label="train_acc")
plt.plot(np.arange(0, n_epoch), history.history["val_Accuracy"], label="val_acc")
plt.title("Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Accuracy")
plt.legend()
<matplotlib.legend.Legend at 0x13846798190>
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["loss"], label="train_loss")
plt.plot(np.arange(0, n_epoch), history.history["val_loss"], label="val_loss")
plt.title("Training Loss")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.legend()
<matplotlib.legend.Legend at 0x13846782760>
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["recall"], label="train_recall")
plt.plot(np.arange(0, n_epoch), history.history["val_recall"], label="val_recall")
plt.title("Training Recall")
plt.xlabel("Epoch #")
plt.ylabel("Recall")
plt.legend()
<matplotlib.legend.Legend at 0x13846f7c250>
y_pred = cnn_model.predict(X_test_selected)
y_pred = [1 if y > 0.5 else 0 for y in y_pred]
test_recall = recall_score(y_test,y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
4/4 [==============================] - 0s 1ms/step
(0.9130434782608695, 0.956140350877193)
cnn_model1 = Sequential([
Conv1D(filters=32,kernel_size=2,activation='relu',input_shape=(n_features,1)),
Dropout(0.5),
Flatten(),
Dense(64, activation='relu'),
Dropout(0.5),
Dense(1,activation='sigmoid')
])
cnn_model1.summary()
Model: "sequential_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d_1 (Conv1D) (None, 9, 32) 96
dropout_4 (Dropout) (None, 9, 32) 0
flatten_1 (Flatten) (None, 288) 0
dense_7 (Dense) (None, 64) 18496
dropout_5 (Dropout) (None, 64) 0
dense_8 (Dense) (None, 1) 65
=================================================================
Total params: 18,657
Trainable params: 18,657
Non-trainable params: 0
_________________________________________________________________
cnn_model1.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),loss='binary_crossentropy',metrics=['Recall','Accuracy'])
history = cnn_model1.fit(X_train_3d, y_train,
validation_data=(X_val_3d, y_val),
epochs=n_epoch)
Epoch 1/20 13/13 [==============================] - 2s 29ms/step - loss: 0.6868 - recall: 0.4966 - Accuracy: 0.6332 - val_loss: 0.6616 - val_recall: 1.0000 - val_Accuracy: 0.5789 Epoch 2/20 13/13 [==============================] - 0s 5ms/step - loss: 0.6513 - recall: 0.7862 - Accuracy: 0.7312 - val_loss: 0.6159 - val_recall: 0.9524 - val_Accuracy: 0.8070 Epoch 3/20 13/13 [==============================] - 0s 5ms/step - loss: 0.5989 - recall: 0.8276 - Accuracy: 0.8216 - val_loss: 0.5330 - val_recall: 0.9524 - val_Accuracy: 0.9298 Epoch 4/20 13/13 [==============================] - 0s 5ms/step - loss: 0.5107 - recall: 0.7379 - Accuracy: 0.8844 - val_loss: 0.4164 - val_recall: 0.8095 - val_Accuracy: 0.9298 Epoch 5/20 13/13 [==============================] - 0s 5ms/step - loss: 0.4076 - recall: 0.7379 - Accuracy: 0.8920 - val_loss: 0.3036 - val_recall: 0.9048 - val_Accuracy: 0.9649 Epoch 6/20 13/13 [==============================] - 0s 6ms/step - loss: 0.3030 - recall: 0.8483 - Accuracy: 0.9196 - val_loss: 0.2276 - val_recall: 0.9048 - val_Accuracy: 0.9649 Epoch 7/20 13/13 [==============================] - 0s 5ms/step - loss: 0.2442 - recall: 0.8207 - Accuracy: 0.9196 - val_loss: 0.1867 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 8/20 13/13 [==============================] - 0s 4ms/step - loss: 0.2289 - recall: 0.8207 - Accuracy: 0.9045 - val_loss: 0.1643 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 9/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1892 - recall: 0.8759 - Accuracy: 0.9422 - val_loss: 0.1586 - val_recall: 0.9048 - val_Accuracy: 0.9123 Epoch 10/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1915 - recall: 0.8828 - Accuracy: 0.9271 - val_loss: 0.1478 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 11/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1918 - recall: 0.8828 - Accuracy: 0.9271 - val_loss: 0.1456 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 12/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1735 - recall: 0.9034 - Accuracy: 0.9397 - val_loss: 0.1435 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 13/20 13/13 [==============================] - 0s 7ms/step - loss: 0.1907 - recall: 0.8552 - Accuracy: 0.9171 - val_loss: 0.1428 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 14/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1741 - recall: 0.8897 - Accuracy: 0.9322 - val_loss: 0.1441 - val_recall: 0.9048 - val_Accuracy: 0.9298 Epoch 15/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1723 - recall: 0.8897 - Accuracy: 0.9296 - val_loss: 0.1429 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 16/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1711 - recall: 0.8759 - Accuracy: 0.9372 - val_loss: 0.1427 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 17/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1759 - recall: 0.8759 - Accuracy: 0.9246 - val_loss: 0.1431 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 18/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1892 - recall: 0.9103 - Accuracy: 0.9372 - val_loss: 0.1440 - val_recall: 0.9048 - val_Accuracy: 0.9298 Epoch 19/20 13/13 [==============================] - 0s 6ms/step - loss: 0.1743 - recall: 0.8759 - Accuracy: 0.9246 - val_loss: 0.1424 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 20/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1627 - recall: 0.9034 - Accuracy: 0.9422 - val_loss: 0.1425 - val_recall: 0.9048 - val_Accuracy: 0.9474
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["Accuracy"], label="train_acc")
plt.plot(np.arange(0, n_epoch), history.history["val_Accuracy"], label="val_acc")
plt.title("Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Accuracy")
plt.legend()
<matplotlib.legend.Legend at 0x13846c4ee20>
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["loss"], label="train_loss")
plt.plot(np.arange(0, n_epoch), history.history["val_loss"], label="val_loss")
plt.title("Training Loss")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.legend()
<matplotlib.legend.Legend at 0x1384697e1c0>
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["recall"], label="train_recall")
plt.plot(np.arange(0, n_epoch), history.history["val_recall"], label="val_recall")
plt.title("Training Recall")
plt.xlabel("Epoch #")
plt.ylabel("Recall")
plt.legend()
<matplotlib.legend.Legend at 0x13846371d90>
y_pred = cnn_model1.predict(X_test_selected)
y_pred = [1 if y > 0.5 else 0 for y in y_pred]
test_recall = recall_score(y_test,y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
4/4 [==============================] - 0s 1ms/step
(0.9347826086956522, 0.956140350877193)
cnn_model2 = Sequential([
Conv1D(filters=32,kernel_size=2,activation='relu',input_shape=(n_features,1)),
Dropout(0.2),
Flatten(),
Dense(32, activation='relu'),
Dropout(0.2),
Dense(16, activation='relu'),
Dropout(0.2),
Dense(1,activation='sigmoid')
])
cnn_model2.summary()
Model: "sequential_9"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d_7 (Conv1D) (None, 9, 32) 96
dropout_18 (Dropout) (None, 9, 32) 0
flatten_7 (Flatten) (None, 288) 0
dense_21 (Dense) (None, 32) 9248
dropout_19 (Dropout) (None, 32) 0
dense_22 (Dense) (None, 16) 528
dropout_20 (Dropout) (None, 16) 0
dense_23 (Dense) (None, 1) 17
=================================================================
Total params: 9,889
Trainable params: 9,889
Non-trainable params: 0
_________________________________________________________________
cnn_model2.compile(optimizer=keras.optimizers.Adam(learning_rate=0.01),loss='binary_crossentropy',metrics=['Recall','Accuracy'])
history = cnn_model2.fit(X_train_3d, y_train,
validation_data=(X_val_3d, y_val),
epochs=n_epoch)
Epoch 1/20 13/13 [==============================] - 2s 26ms/step - loss: 0.5871 - recall: 0.6414 - Accuracy: 0.8216 - val_loss: 0.2930 - val_recall: 0.9048 - val_Accuracy: 0.9123 Epoch 2/20 13/13 [==============================] - 0s 5ms/step - loss: 0.2461 - recall: 0.8276 - Accuracy: 0.9020 - val_loss: 0.1629 - val_recall: 0.9524 - val_Accuracy: 0.9298 Epoch 3/20 13/13 [==============================] - 0s 5ms/step - loss: 0.2088 - recall: 0.8966 - Accuracy: 0.9271 - val_loss: 0.1510 - val_recall: 0.8571 - val_Accuracy: 0.9474 Epoch 4/20 13/13 [==============================] - 0s 5ms/step - loss: 0.2039 - recall: 0.9103 - Accuracy: 0.9146 - val_loss: 0.2507 - val_recall: 0.8095 - val_Accuracy: 0.9298 Epoch 5/20 13/13 [==============================] - 0s 5ms/step - loss: 0.2832 - recall: 0.8345 - Accuracy: 0.8794 - val_loss: 0.1660 - val_recall: 0.9048 - val_Accuracy: 0.9123 Epoch 6/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1890 - recall: 0.8276 - Accuracy: 0.9171 - val_loss: 0.1447 - val_recall: 0.9048 - val_Accuracy: 0.9298 Epoch 7/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1684 - recall: 0.8897 - Accuracy: 0.9271 - val_loss: 0.1613 - val_recall: 0.9048 - val_Accuracy: 0.9123 Epoch 8/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1609 - recall: 0.8828 - Accuracy: 0.9422 - val_loss: 0.1818 - val_recall: 0.9524 - val_Accuracy: 0.9298 Epoch 9/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1749 - recall: 0.8897 - Accuracy: 0.9296 - val_loss: 0.1443 - val_recall: 0.9048 - val_Accuracy: 0.9298 Epoch 10/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1581 - recall: 0.9172 - Accuracy: 0.9397 - val_loss: 0.1479 - val_recall: 0.8571 - val_Accuracy: 0.9298 Epoch 11/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1574 - recall: 0.9172 - Accuracy: 0.9422 - val_loss: 0.1667 - val_recall: 0.8571 - val_Accuracy: 0.9474 Epoch 12/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1623 - recall: 0.9241 - Accuracy: 0.9497 - val_loss: 0.1447 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 13/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1351 - recall: 0.9034 - Accuracy: 0.9523 - val_loss: 0.1483 - val_recall: 0.9524 - val_Accuracy: 0.9474 Epoch 14/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1483 - recall: 0.8966 - Accuracy: 0.9296 - val_loss: 0.1453 - val_recall: 0.9524 - val_Accuracy: 0.9474 Epoch 15/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1289 - recall: 0.9241 - Accuracy: 0.9523 - val_loss: 0.1404 - val_recall: 0.9048 - val_Accuracy: 0.9298 Epoch 16/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1517 - recall: 0.8897 - Accuracy: 0.9372 - val_loss: 0.1435 - val_recall: 0.9524 - val_Accuracy: 0.9474 Epoch 17/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1231 - recall: 0.9241 - Accuracy: 0.9598 - val_loss: 0.1654 - val_recall: 0.9524 - val_Accuracy: 0.9298 Epoch 18/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1330 - recall: 0.9103 - Accuracy: 0.9472 - val_loss: 0.1405 - val_recall: 0.9524 - val_Accuracy: 0.9474 Epoch 19/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1344 - recall: 0.9379 - Accuracy: 0.9623 - val_loss: 0.1590 - val_recall: 0.8571 - val_Accuracy: 0.9474 Epoch 20/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1435 - recall: 0.9172 - Accuracy: 0.9497 - val_loss: 0.1407 - val_recall: 0.8571 - val_Accuracy: 0.9474
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["Accuracy"], label="train_acc")
plt.plot(np.arange(0, n_epoch), history.history["val_Accuracy"], label="val_acc")
plt.title("Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Accuracy")
plt.legend()
<matplotlib.legend.Legend at 0x1384632b8e0>
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["loss"], label="train_loss")
plt.plot(np.arange(0, n_epoch), history.history["val_loss"], label="val_loss")
plt.title("Training Loss")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.legend()
<matplotlib.legend.Legend at 0x13857238df0>
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["recall"], label="train_recall")
plt.plot(np.arange(0, n_epoch), history.history["val_recall"], label="val_recall")
plt.title("Training Recall")
plt.xlabel("Epoch #")
plt.ylabel("Recall")
plt.legend()
<matplotlib.legend.Legend at 0x1385837ed30>
y_pred = cnn_model2.predict(X_test_selected)
y_pred = [1 if y > 0.5 else 0 for y in y_pred]
test_recall = recall_score(y_test,y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
4/4 [==============================] - 0s 2ms/step
(0.9130434782608695, 0.9649122807017544)
cnn_model = Sequential([
Conv1D(filters=32,kernel_size=2,activation='relu',input_shape=(n_features,1)),
Dropout(0.2),
Flatten(),
Dense(64, activation='relu'),
Dropout(0.3),
Dense(1,activation='sigmoid')
])
cnn_model.summary()
Model: "sequential_7"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d_5 (Conv1D) (None, 9, 32) 96
dropout_13 (Dropout) (None, 9, 32) 0
flatten_5 (Flatten) (None, 288) 0
dense_16 (Dense) (None, 64) 18496
dropout_14 (Dropout) (None, 64) 0
dense_17 (Dense) (None, 1) 65
=================================================================
Total params: 18,657
Trainable params: 18,657
Non-trainable params: 0
_________________________________________________________________
cnn_model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),loss='binary_crossentropy',metrics=['Recall','Accuracy'])
# Fit the model
history = cnn_model.fit(X_train_3d, y_train,
validation_data=(X_val_3d, y_val),
epochs=n_epoch)
Epoch 1/20 13/13 [==============================] - 2s 27ms/step - loss: 0.6751 - recall: 0.8828 - Accuracy: 0.5176 - val_loss: 0.6533 - val_recall: 1.0000 - val_Accuracy: 0.4737 Epoch 2/20 13/13 [==============================] - 0s 5ms/step - loss: 0.6328 - recall: 0.9931 - Accuracy: 0.6834 - val_loss: 0.5941 - val_recall: 0.9524 - val_Accuracy: 0.8070 Epoch 3/20 13/13 [==============================] - 0s 5ms/step - loss: 0.5655 - recall: 0.9586 - Accuracy: 0.8518 - val_loss: 0.4981 - val_recall: 0.9524 - val_Accuracy: 0.9123 Epoch 4/20 13/13 [==============================] - 0s 5ms/step - loss: 0.4606 - recall: 0.8966 - Accuracy: 0.9322 - val_loss: 0.3827 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 5/20 13/13 [==============================] - 0s 5ms/step - loss: 0.3627 - recall: 0.8276 - Accuracy: 0.9146 - val_loss: 0.2745 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 6/20 13/13 [==============================] - 0s 5ms/step - loss: 0.2587 - recall: 0.8759 - Accuracy: 0.9397 - val_loss: 0.2032 - val_recall: 0.9048 - val_Accuracy: 0.9649 Epoch 7/20 13/13 [==============================] - 0s 5ms/step - loss: 0.2173 - recall: 0.8345 - Accuracy: 0.9221 - val_loss: 0.1687 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 8/20 13/13 [==============================] - 0s 7ms/step - loss: 0.1843 - recall: 0.8621 - Accuracy: 0.9372 - val_loss: 0.1519 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 9/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1816 - recall: 0.8828 - Accuracy: 0.9322 - val_loss: 0.1488 - val_recall: 0.9048 - val_Accuracy: 0.9298 Epoch 10/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1707 - recall: 0.8690 - Accuracy: 0.9271 - val_loss: 0.1438 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 11/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1722 - recall: 0.9172 - Accuracy: 0.9447 - val_loss: 0.1418 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 12/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1689 - recall: 0.8621 - Accuracy: 0.9322 - val_loss: 0.1408 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 13/20 13/13 [==============================] - 0s 6ms/step - loss: 0.1667 - recall: 0.8621 - Accuracy: 0.9296 - val_loss: 0.1412 - val_recall: 0.9048 - val_Accuracy: 0.9298 Epoch 14/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1623 - recall: 0.9034 - Accuracy: 0.9422 - val_loss: 0.1413 - val_recall: 0.9048 - val_Accuracy: 0.9298 Epoch 15/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1534 - recall: 0.8759 - Accuracy: 0.9372 - val_loss: 0.1415 - val_recall: 0.9048 - val_Accuracy: 0.9298 Epoch 16/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1596 - recall: 0.8897 - Accuracy: 0.9347 - val_loss: 0.1406 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 17/20 13/13 [==============================] - 0s 4ms/step - loss: 0.1553 - recall: 0.9103 - Accuracy: 0.9497 - val_loss: 0.1424 - val_recall: 0.9048 - val_Accuracy: 0.9298 Epoch 18/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1565 - recall: 0.9310 - Accuracy: 0.9397 - val_loss: 0.1405 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 19/20 13/13 [==============================] - 0s 5ms/step - loss: 0.1518 - recall: 0.8828 - Accuracy: 0.9447 - val_loss: 0.1404 - val_recall: 0.9048 - val_Accuracy: 0.9474 Epoch 20/20 13/13 [==============================] - 0s 8ms/step - loss: 0.1491 - recall: 0.9103 - Accuracy: 0.9447 - val_loss: 0.1420 - val_recall: 0.9048 - val_Accuracy: 0.9298
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["Accuracy"], label="train_acc")
plt.plot(np.arange(0, n_epoch), history.history["val_Accuracy"], label="val_acc")
plt.title("Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Accuracy")
plt.legend()
<matplotlib.legend.Legend at 0x13855a3a0d0>
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["loss"], label="train_loss")
plt.plot(np.arange(0, n_epoch), history.history["val_loss"], label="val_loss")
plt.title("Training Loss")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.legend()
<matplotlib.legend.Legend at 0x13855a372b0>
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["recall"], label="train_recall")
plt.plot(np.arange(0, n_epoch), history.history["val_recall"], label="val_recall")
plt.title("Training Recall")
plt.xlabel("Epoch #")
plt.ylabel("Recall")
plt.legend()
<matplotlib.legend.Legend at 0x13855aedb20>
y_pred = cnn_model.predict(X_test_selected)
y_pred = [1 if y > 0.5 else 0 for y in y_pred]
test_recall = recall_score(y_test,y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
4/4 [==============================] - 0s 2ms/step
(0.9565217391304348, 0.956140350877193)
insert_results("Convolutional Neural Network", history.history["recall"][-1], history.history["Accuracy"][-1], test_recall, test_accuracy)
results_df
| Index | Model Name | Training Set Recall | Training Set Accuracy | Testing Set Recall | Testing Set Accuracy | |
|---|---|---|---|---|---|---|
| 0 | 1 | K-Nearest Neighbours | 0.937931 | 0.964824 | 0.956522 | 0.964912 |
| 0 | 2 | Support Vector Classifier | 0.944828 | 0.974874 | 0.913043 | 0.95614 |
| 0 | 3 | Random Forest Classifier | 0.958621 | 0.979899 | 0.956522 | 0.947368 |
| 0 | 4 | Naive Bayes Classifier | 0.917241 | 0.942211 | 0.956522 | 0.95614 |
| 0 | 5 | Logistic Regression | 0.931034 | 0.972362 | 0.913043 | 0.95614 |
| 0 | 6 | Single Layer Perceptron | 0.868966 | 0.937186 | 0.934783 | 0.947368 |
| 0 | 7 | Multilayer Perceptron | 0.868966 | 0.937186 | 0.956522 | 0.95614 |
| 0 | 8 | Convolutional Neural Network | 0.910345 | 0.944724 | 0.956522 | 0.95614 |
For the final model, we aim to achieve the highest recall and accuracy when making predictions on unseen data. To do this, we will combine several of the best performing models that we will look at, and derive final predictions by averaging the results. The machine learning models that will be used in the final ensemble model will be: the KNN model, the Naive Bayes classifier, and the multilayer perceptron. These models will be weighted according to their performance in the previous section.
knn_optimised.best_estimator_.get_params()
{'algorithm': 'auto',
'leaf_size': 30,
'metric': 'minkowski',
'metric_params': None,
'n_jobs': None,
'n_neighbors': 3,
'p': 2,
'weights': 'uniform'}
nb_optimised.best_estimator_.get_params()
{'priors': None, 'var_smoothing': 0.0001}
cnn_model.summary()
Model: "sequential_7"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d_5 (Conv1D) (None, 9, 32) 96
dropout_13 (Dropout) (None, 9, 32) 0
flatten_5 (Flatten) (None, 288) 0
dense_16 (Dense) (None, 64) 18496
dropout_14 (Dropout) (None, 64) 0
dense_17 (Dense) (None, 1) 65
=================================================================
Total params: 18,657
Trainable params: 18,657
Non-trainable params: 0
_________________________________________________________________
def averaged_predictions(x, m1, m2, m3, weights):
y_m1 = m1.predict(x)
y_m1 = [weights[0] if y > 0.5 else 1 - weights[0] for y in y_m1]
y_m2 = m2.predict(x)
y_m2 = [weights[1] if y > 0.5 else 1 - weights[1] for y in y_m2]
y_m3 = m3.predict(x)
y_m3 = [weights[2] if y > 0.5 else 1 - weights[2] for y in y_m3]
result = [x + y + z for x, y, z in zip(y_m1, y_m2, y_m3)]
result = [1 if y > sum(weights)/2 else 0 for y in result]
return result
y_pred = averaged_predictions(X_test_selected, knn_optimised, nb_optimised, cnn_model, (1, 0.7, 0.8))
test_recall = recall_score(y_test,y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
4/4 [==============================] - 0s 2ms/step
(0.9782608695652174, 0.9649122807017544)